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A metfaud for geue mapping firom genotype and phcnotype data 
ViM of the InventloD 

The present invenilon relaies lo a method for gene mapping fiom genotype and 
phcnotype data, which method utilizes linkage diseqiillihrium between gencdc 
5 mada^ mf, "which are polymozphic nucleic acid or protein Kequences or strings of 

single-nudeotide poiyznorphisms deriving Irnm a chromosoxna] re^on. 
Background of tho invention 

The use of Jinkagc disequilibrium (LD) in detecting disease genes has recently 
drawn much anention in genelic epidciuioloKy. LD is evaluated with asBOciation 

10 analysis, which, when applied to disease-gene mapping, requires the cojjjpaiisoii uf 
allele or haplotype ftequencies bclween the affected and the ooutrol individuals, 
under the assumption that a reasonable pr<^ortion of disease-associated chroxuo- 
somes has been derived ftom a common ancestor. Traditional association analysis 
methods have long been used to test the hivolvemeni of candidate genes In diseases 

15 and, hi special circumstances, to fine-map disease loci found by linkage methods. 
The testing has mostly been done using simple two-point naeasures. 

Improved statistical methods to detect T.n have been presented lately (Tcrv/Illigcr 
1995; Devlin et al. 1996; Lazzcroni 1998; McPcck and Strahs 1999; Service et al. 
1999). The newer methods are based on .statistical models of LD aruuud a disease 

20 susceptibility (DS) ^ene. Genomic regions rather than alleles - that are shared 
among afl^cted individiials, are searched fbr. The recombiiiatioji histoiy from the 
common ancestor to the present day is taken into account with more or Ip-sr simpli- 
fied statistical models. The power of ftiesc luethods, as well as their ability to local- 
ize the correct position of the DS gene, has been shown to be betrer than thai of tta- 

25 ditional methods. Scjme of the modds are robust against high levels of edologic 
heterogeneity (MoPeak and Strahs 1999; Service et al. 1999). However, the mesh- 
ods ooniain assunanicms about the inheritance model of the disease and the structure 
of the survey population, and the eiiects of violations of these assumptions iu (he 
real data are not known. In euidition, they can only consider association of one re- 

30 gion at a thne. Thus, they are currently best suited for fine mappiug iatbcr than 
complex disease mapping or genome screening. TTie mediods also tend to be com- 
putationally heavy. 

The present inventors have recently introduced a so-called haplotype pattern miring 
(HPM) method (Toivonch ot ol. OOOOa and 200Qbi). In the HPM method, haplotype 
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pamans are ordered by their strength of asaociotion with the phenoiype, and all hap- 
lo^B patterns exceeding a s^'ven threshold level are used Xbr prediction of disease 
susceplibilily sen© location. The advantage of the HPM meihod is that it is model- 
free as it does not require any assumptions about die lulieritance model of ttc dis- 
5 case. The haplotypc patterns are allowed to con^n gaps and therefnre the HFM 
method is quite robiisi agahist nwiatlons and lo uiisslng and erroneous data. How 
ever, the basis of die UPM medjod is that haplotypes, Le. separate vectors of alleles 
of miirkers, are available. As viill be explained below, this requirement causes vaii- 
ous problems in gene mapping methods, and thus also in the HPM medjod. 

10 Zhaus et al. (2002) have extended the HPM method to aUow simultaneonj; ikc of 
haplotype data of related individuals wilh-quHutitativc trait from an oxtonded pedi- 
gree. This is done by employing the Quantitative Pedigree Disequilibrium Test 
CQPDT) stari.stic to measure the siicugdi of association between haplotype and a 
quantitative trait. 

15 The standard procedure in association-based gene mapping is tn I) ascertain indi- 
\dduais carrying the trait of interest and Iheir femily members (atleost parents), 2) 
genotype flic individuals, 3) derive the haplotypes compnratlonally using genotypes 
within families, and finally lo 4) liud assodations in the haplo^es (gene mapping>. 

Even though the acmal assoeiadou analysis is done on solo case and oontrol haplo- 
20 lypcs, obtaining these haplotypes requires the parents of the affected indlvlduids lo 
be genoiyped as well: vast majority of haplotyping programs available ea^ect the 
parental genotypes to exist. This means that the parents first have tt> be recruilcd, 
which is not always sirdighlfoiward, as they might no longer be alive, or cannot be 
reached, or refuse from giving blood samples. Qenoiyping more mdividuals is labo- 
25 rious and elevates die study expenses: per every case or control, 3 uidividuals will 
be genoiyped insread of just one, no genotyping Is done on 3 Ihnes as many persons 
as there are eases and ccmtrob. In case the non tranmutted parental ehromosniriei 
could be used as contrnlq, a case and his/her parents conuibutcs one casc-conurol 
pak, in which case the genotyping effort is 1.5 times higher than the number of ca- 
30 ses and controls needed. 

As an alternative m these haplotyping approaches, some mcfliods for direct haplo- 
typing from population-baaed data have indeed been presented, hut the problem 
with these is that they still produce a lot of mistakes, which is a very bad starting 
point fiir any haplotype based association program. 
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possible taplotypcconfiguranOTS 13 2 (orl.iiN «;. r 

of h»r1nW' St"™? ^°^i^?^r»^^ is based on expHclfly «msi(toing 
St«to (1999). The approaA of Zlung and ^ f^^kTor ouAa map. of 

iBKrestins si»8 as was descnbed above W ^ ^ „ ^jet. 

dom sample ftom chromosome popuhflou "^-fl^^^ rf^*^^ ^ bo a «n- 
U aueie and d allele. Chtomoson«s ta normal «^«^"^.f^"„^„es. Next, a 
aom sample of chromoaome P-n?'^"'*^^ '^'^^/flf "^i p«se«edby 

rr: otnr^ ir^i.- - t^.r^rs ^^^^ 

^ ben^eu marker, and ""r* "^^f^'sths^Xnotype data (a, 

Zhang and Zhao p^enl) a. the '^^^"V^^^^L compatible with i. 
. ftr each gem^lypea toe ace sovcrfhaploOTe pa^ l^^ ?teHteUhMa 

for eui obscarvcd genotype w tuc sum oi x>,™mlaiDd as above. The ge- 

poadble ancestrd haplc^ P«™l°»«» I"'™* "'° ^ 
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haplotype ftcqucndcs are estimaicd with Markov model, and any which are belcw 
some ptespecified level are left unconsidered. 

The approach of Zhang and Zhao has the following serious dravrtiacks. Fhsi, the 
prmciple of Zhang and Zhao is to wqpUcitly consider all poaaible hqjlotype canfifiu- 

5 radonB. This is feasible only with very small marVer maps. Second, to avoid the first 
problem and to extend the appUcabiHty of the approach to larger maps, Zhang and 
Zhao apply additional pnming techniques to reduce the nmnber of h!g)loiypc cojir 
figm-ations they need to consider. However, those techniques are complex and errors 
prone. Third, theh approach Is based on stmmring probabilities of diffcient haplo- 

10 type configuEations. Such an approach is not directly ^plicable to pattern-based 
mapping meiliods such as HFM. - " 

Curtis et al. (2001) studied the a« of an artificial neuial network lo detect associa- 
tion between disease and mult^le marker genotypes. The pattem-recosnition prop- 
erties of the network were used hi the hope ihal luarker haplotypcs impUcit m the 
15 genotypes differed between cases and controls in a way which led to the network 
behig able to classify the subjecib corrccUy, accoiding to thch marker genotype. 

Summary nf the invention 

•j-he object of the present invention is to provide a modd-ftee and compulaaonally 
effective mcfliod allowing direct association analysis on genotype rather.than haplo- 
20 type data, which overcome.«! the above-meniioned drawbacks. The Livention offers 
remarkable advantages by avoiding the technically difScult, costly and sometimes 
impos.sihle steps of recruiting and genotyphig family members, as wcU as by avoid- 
ing some of the error sources present in population-based haplotyping method."!. 

The above-mentioned object is achieved in acoordanea with the mvention by the 
7^ method far gene mapphig ftoro gaiotype and phcuotype data, which utilizes linkage 
disequilibrium between genetic markers Wf. which are polynioiphic nucleic acid or 
protein sequences or strings of shxgl©-auclcotid© polymorphisms dceivmg from a 
chromosomal region. The me&od accordmg to the hiviwitinn is characuoized by the 
fbltowlng steps: 

,0 i) all marker paltcnis P that sadsfy a paiteni evaluation ftnotion e(P) are 

seatohed from die data, wherein 
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^ die maiier patterns are expressions involying the marlcer-aUele as- 
1^ or n^oTcf the foUowing: indi^dual covariau^, 
environmcaital variables and auxiliary phcnotypea; and 
b Oxc pattern evaluatioafmctione(P; involves some siarisn 

In^ studied, 

by tesdns each -"dLK of pMicm P aguna fto con^ponding oIlel= pair 

«i^»«ia= of O P "»"*^ 

as matches, - • 

flancOoa ufte set S, defined as the set of mfldcer P°f»^.";;«'>''K>^ 
jGucd iu step (i), aad 

15 HO fl«loo«tioaofthei^ is Indicted., a ttau^o^o^he.™^^^ 

aU*emsttosm, mlhcdal^ and is based mffla:umizmglhc »o«>f 

storing fonetiaa is designed ro give hister scores elosex K, *e ^ 

on mLri:n.B »o« If tl« scoring flmcUon is 

I L. ♦« iffineu OS is the case for instance when the 
'ZZ.'Zl^'^^^^ teon^er-resdable d^ sto. 

thereon. s»ld executable proB-u code b-S op«». 
, Srtoporii»mam«hodo£anyeml»dimeotsotthenwent.on«hen 

execviiteA on a compuler. 
25 A computer system according «> the Jnvention is programmed to perform the 
method of any emboduncuts of the invemion. 

' A. r*.rmWotvoe* defines a vector of alleles in a single chromo- 
As used hcrcm the term h^U)^ of (uiiphascd) allele 

some. Also, as used hesreiTi the term genoiype hbiiuc»*v 

pairs in a chromosome pair. 
30 ll«t..-'-crosatemr.n3cddcfinesasmaUrun(^^^^^^^^ 

dem repeats of a very simple PNA sequence, usually M ^; ^ .J^m^i^ 

hosZused as the primary tool for geneticmappmg^^^^^^^^^ 

lie geneiic.locus' is a gene wiih higji level of vananon, there are several xyp 
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in the ge^ locus. cacH ^t. reasonably ^^^^j;-^ I^^^^^^^ 
Xoticie polyraoxpwsn, defines P^^^^^"^^^^^ edible to large- 
Although less informative than microsatellites, SNW ate 

scale niitOTnaied scoring. 
; Brief description of the drawings 

, .xf MPM-Q comoared to HPM: the y axis 
shows whidi iracUoii of simulated dala sets is in 

which is given on the X axis. ,™T,xn 

*■ c«t«nie size-on localizaHon accuwoy wiUi a) HFM-Q 

n . . J to /^io/^ 10%^ on localization accuracy wim 
associated aud 200 control chromosomes). 

Plgu.e4 showsaiccire«ofIOOpe«iu..tic^onlocali^onaccuracy. 
15 DBtaflodaescriptiOBOftlieinveuUuu 

may also be a combfaaticm of several ptoKWes- 
^ .e«»a acco««ng to invention. 

„^od uses both S^-W^ ""T; „ ^ fieq»ncies of 

« genedc con«buaoo. aUccted ^f^tT;^^, Combi^ons 

associated maricer alleles nea, of affecsd indivWoaJs tlan 

of market alldes which ate moB ft«iaeiit m g«oiW« 

in genotype 

tiops .bo»l the mod. of inbentrnvce of .^e 4e 
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the DS g««.Tenn. mate P«tcm«dhq*.typcp-*an denote tesBUC concept, 
and are used interehangeSbly to*ls teat 

•l^e method ^cording to the present invenBon is an algorithm-hRed "^Basion of 
ttadition.1 a«ooiation andysis. It woAs ^ « non-panunctnc »odd^ 
5 ^r»y genedc modei Ue Io«,««H™. power of the me«hnd of the .ny e^m 
ovLVc^M. ^cto .nnMpte todepeodent founder n.«el|^^ 

Wee^ 5-15% at reoUstic sampte sizes (100 affected fadivid^ls md a sm.^ 
Of populatK^t co^tc.). - — ,te"S^"l= S - 

T.n ihertm-random associaHoa of maiKer alleles ^If'f^^".*!.^!^^ 
mily .» be strongest around the DS 8m»: consequently the locus » W^.*" ^ 
^^elst of tHe smm^est assudafiuo. are. In the HPM-Q 
,3 ^ invention. v«se«ch fa d»n=d,fl=xn>lehsplotypesfl« may contoi^^^^ 

^^rtwch ones are suoogly assoeiared vdfl. ^ disease "^'T^'^'^'J^ 
^-plnetric model tbr predicting the DS locus, on ^ «^the 
S«baplo.yp«. Petmutadon tests canbe used to conttast the rcaultaagamMtbonull 

hypothesis that tliere is no gene eflfent. 

20 Miu-kar or Sigito^ Paturra audOlsoase Assadalim 

we «amin. linkage disequSMumbymfamarterorhapIotype^^^ 

mapMwMiifcmaiitHsm; mjfc, a "maik« pattern « ^"W^'^™ 

is ^fined as a vecror (pi....^.*), ^ e«=h« U «tt.= an aUeie of naeker M, or fte 

M "don't care" symbol (•). 1^ ^'^^^^,Z\^%tf\^'?^^^^ 
(chi«»noK>me)iKA/.-.^ili ^ „° . f„ ell 
given B»noiype G-(te.i. »u> < *>^» ("-S" " " 

For example, consider a madter mep oif 10 matHets. Th« vednr Pi = 2. 5. 3. 
,„ . n where 1.2.3....sren«iriceraUeles.isanexampleofah^lmy^^ 

pLtloccora fainstence,machrom»somewiflrh^l«^(*,^^^^^^^^ 
It 3). -me pattern also occurs in the genotype ((2,3). (W>. i^-^^^'^J-J^^ 
li,' ). {W}. (1. 4}. {3.5}. (1. 6». (Por inrt««e. (2.5) is "f matter 

1 ; tlie alleles are 2 and 5. but llidr liliasea are not known.) 



I 



Sl^al Vy d^cen. to rt» .i»«sB^«ct««i. la doing tbis. fl^ ^ T^^^^^j^ 
suK, wUi respect to flw .Nws of h^lotyp^ P-tteos: the genetx leng* of ft . s g 
^a^pattcms and gaps. We deft,e ^r^^^^'^l^^ 

^kL m(« wilUpj ^ * SeanlUng fa taplow" P«<«m» of 

Sr«^* lean no, in ^-enific^t «un*ers. Consequently, wl.ea Uaptotype prt- 
^^^t^hcdfor. thcn-^nuunlengd- of pattens tobe^^^ — 
0 stranedwlh an optional pffl«in-sear*pat«meter to ^ 

We allow fa gaps in tlie tnaiter patterns, stace mutations, gott coavorsion!, errors, 

^^„bin«i«« ean com,, contumo^ h^loW=es. Marker 
^a^^rs t«.ieally oa^ very shon gaps only. Missing i^"" 

consecutive r^l^ depending on .he dm eoHeCon schema '^.JJ' 
ts Ll« introlneedhy double reeon.binatio»s-.iacUhowev«. are ra«o^^^ 

abort distances. In .lie HPM-O loethod, the madmuni n»mber and maximimi length 
of gaps can lie connoUed wiih pultem search parameters. 

Minmgl'tKoie.AssoclatedBviloVPeP''''^ 

25 evaluation flmcsdon e as aeflncd ia step (i). 

ouq,!* 5 of this phase is set of those maAer pattems that satisfy i.e., S { 



30 



tiose 19 Uie aei oi. mwc r - - 

U\e(P)lswue}. 

m suip cm. Ibr e^ih merlcer ™, in me data, let ^ C/- ^ ^ I '^--f ^"f^^J " 
^/iaSxhatoverlap^thc«orlcerm,.Inthi.ph,seea«l.m..J=.r,n,-l»»co«a« 
a ibnctlwi of iV/, and ihe result Is ^rwj). 

In step (iii). the locarton of the gene is ptediaed as a '"^^'T^ 

aU Jlrkcri in the data. TWs toetioa returns m are» 8=™ « ^ 
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u, be. Tl^c area caa be coadgucus or fragmented, and k can be a point in a special 
case. 

Y^^^ff 1 the al gorithm tnrTr"'^-r"*^ ''^"^^""g 

space of partems, a sTanoara pwv,c- oattem sans- 

fi« to =val>>«ioa toeUon, thea aU mote geo«a p.nen., Mso Mti«fy it 

Input 

• set C/of possible niarkei patterns 

• evaluation function for patterns P in i; 

13 . (generalization) relatiOTK for patterns in 17 . ^j^^j < 

. where the fimction. and the relation < ate ^ichihai if fi(P) IS mie ana /- 

then efC^O is also true 
Output 

• sets - {P ^ U\ e{P) is trve} of paltems 

20 Method 

1. 5'.= {> 

2. //Initialize the set of evahiated patterns: 
3. 

4. // Start with the most general patliems: ^ ^ ^ p\ 

6. //Recursively evaluate patterns in a depth flrsEorfler. 

7. fiacacfaP e 08» { evaluotePattemsCP) > 

8. end; 

9. procedure evolufliePattecnB(P) { 

■iO 1 0^ insert P hilo the set E 

U. ifcCP) = fr«ethen{ 

10 inseitPinto sel5 , j ^.j 

IS. // Hndall spedato^ions of P that have not been i^sted ycU and 

14. // evaluate Ihem recursively: 
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15 Spec:={P'hia-B P < P', P' l« P, iotd there is no P" m U E, P l-P 

15 flwdP"I=:/»', withP<P"<n: 

17. foteach P' in ^ { cvalwaePaltBaisCP'); 3 

18. } 

19. } 

Veisioxi 2 of die ..i p i nrithm for m «T4cgr pattern searching 

foUowing -Igudfluu i, . simple S"-"; l^^''^^ S 
m of the method according to the invonHon. IHs based on dsp^-Stsc seara m m 

^ ^[.SXigaortaS !"«^e<j»m and.ihe«fi« smdsdoally 1«» noporM.t pa.- 



terns. 



Deto an auxiliary evaluadon ftnOioa ^e(P) whid. is m.e if a«d "-"ly 
Lu^ed ^owhoc) and replace the original «alaa..on ^^^^^^ ^ 

™^ patUs are siaTlsticaUy not retevant and Itercfuie Uttte .nfbnnation « lo« 

cal toq.lteatlon based on flie paUffln »yniax; i> <f .f and only if 
M li. alswithm nses the generalizaHon Kladon based on logical impUoatlon to sttua- 
Z space, Td fho .oxiliary ftaction to prune tlvc »areh space. All 

;:::^aX^^«ese^fi».buiunly OK^ealaosadaftring.a-cn^ 

Input 

• set C/ofposaiblcmorlcer patterns 

7.^ • eva1i3aiionfiiiictioneCP)fOTPs«»nK'P"*^ 
■ ficqucnoy threaholdx 

Tils ^iP'.r.U\eiF>anci aetJP) ts tnie} of paucrus, where is true if and 
aafy if the frequency of partem P exceeds a given threshold r 

30 Mctibod 

21. // Initialize the set of evaluated patterns: 

22. JS {} 
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23. // Start with the most general patterns: , „ no 

24. C;^n := in U \ thPj-P. is no P' in U. P' != P, such that P-^P) 
25 Jf Rccuisively evaluate patterns in a depth-first order: 

26. foreach P in fkmi evaluatePanemsCi*) > 

27. end 

28. procedure evaIuatePaiierDs(P) ( 

29. insert P into the set £ 

30. iffleC/*) "f"^*^^ 

oi if e/P^rs/rwe then insert Pinto set 

32. FMaU specUlizaticm ofP have not bccate^dy^. and «»taa» 

33. //them recursively: df»;„ 7/ E P'M-P 
34 Spec:-{P'mU-E\P'^P,P'^-^^^^'-'^'^^ '^^^'^ ^ ^ 

35. l-P', P"a>viP"->P > 

36. foi-each f in Spec { evaluatcPattcms(P') } 
15 37. > 

38. > 

Version 3 of th e alsoritbm fhr mmkrx patcctn searching 

When phcnotype .ein. .luOied \^ZT> XC^^C^^^^^^^^ 

sociadon measure jt^ and x « a u.cr spccifaca rmm ^^^^^ suffidcnay 
that the sizes ot-A; are large finongh,siich as 7, to give siaas j^^^ 

and 1JSC5 lb to prune the seorch. 
Input 

• marker map M = {mu ... ,rnjci 

• phenotype vector y=(yi, 1«) 

and marker) 
. association threshold*: for ohi-squarediesl 

• nuocifflompaiticrnlengih/ 

• maximum number of gi^s g 
35 • tnoxiiaoia size s 



20 



25 
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12 

Output 

. where U cons«»u* of patterns on il/that consist of markcr^aUclc assignments and 

that adhere to parameters /, g, and /, and ^ . 

5 . whct-ceCP)istrucifandonlyifchi-3quin:edtestonPn8mss«io^ema1«xfl^ 

andphenotypes r exceeds the given threshold ar 

Method 
39.5 :={> 

40. // Number of case and control persoas: 
10 /I l.pi^ number of affected persons; 
A2.pic '•— munber of conteol persona; 

44 // A. lower bound for partem flceq.uency: 
45. lb :=-piA''P^*^^ (P^C *Pi " P^A * ^) 
15 46. //Variable for iterating over diJTcrent patterns: 

Js.*^// aneleTcwO isthe set of alleles ofOic /;lhmatker 
50. forench a in alIeles(?M/) { 
20 51. pi:'=a 

52. //Test pattern 2* and all its eotfcnaions: 

53. checkPattems(i', /, l, 0. Cf) 

54. //Reset /J/: 

55. Pt:-'^' 

25 56. } 
57. > 

^T..^ h^lowe pettein /» and an patterns 

60.//ftomtheri^t: 
30 61 procedure checkPaUcms(/'. start, i, nr^^^^. gapjength) { 

64. // Return if ejdended pattfims would be TOO long: 

65 if i - A or i-^I-start > i then reram ^ 

35 66 // Itetum if extended patterns can not be smingly diseascrassoeiated. 

67, if frequency of i» in diaease-^ociated persons is less than Ih 

68. then reram; 
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69. // Create and test legal extensions of cuneol pattern P (3 cases): 

70. // 1 - C3ive marker i \ 1 all possible values: 

71. ftacach a in allcles(inj+/) ( 

72. 

5 73. chfickHattems (P, start, i+2, nrjrf^jf^* 0) 

74. > , , , 

75. // 2. Inttodnce a new gap starting at marker /+/ : 
70. if PI * '*' and wj>f_Sfips <gsads^J lihen { 
77. W2 '*' 

10 78. cUcckPanems (P, siort /+/. r(rj^_S.aps-^U I) 

79. } , 

80. // 3 • Extend the current gap over mariter 
K 1 . i£pi=' '*' gapjength < s then { 

82. Pi+I'^'*' 

15 83. checkPttlierns (/», a/or/i i+i. nr^oU^px. gapjength-^ J) 

84. > 

83 . Dcfore reraxning, reset pi+j : 

86. /J/+; := 

87. reiura 
20 88. ) 

Tj^.ir«^ A nf the alcorithm for marker pattern searching 

The foflowing algorithm is a simple, generic, and efficient way to ^pl^neut ^ 
^ of LTmod according to the invention. It is based on li.e Icvdv^ise search 
method described in Mannila and Toivonen (1997). 

25 lupw 

• serr/ of possible inarlccr patterns 

• evaluation fimctionK'*) for patterns P hi £/ ^ ^ ^ „„athf.Tda- 

• (ger«raUzaao«)rektion< for patterns in a where the ftaw^ 

tion < are suchrhat if e(P) isinieand?'-^ P, then KP') is also true 

30 Output 

• set 5 = {P in 1/ 1 e(P) w *w} of patterns 

Definitions 
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ciali/atiom of panem P. 

Method 
89.5 :-{> 
90.(2 :={} 

10 91.// Stan vdth The most general patterns: 

92. /r : = in C/ 1 Oiere tsmP'mU.P' l^P. iiuch that P < P}; 

93. while ;?!={}{ 

94. // Evaluate ihe cancUdate paiicnis: 

95. fereachPinJ^" { 

J g if eCP) dicn insert P into set 

97 . else remove 7* Horn set F 

98. ) 

99. g: = guoioai^ 

100. // Generate a new ser of candidate pottema: 
20 10 1. C : ^ {) 

102. fbreachPini^ { ii d^m^ /oorP'^- 

104. P'Mnfi} 

105. } 

25 106. 1J': = C 

107. } 

108. end 

x,^^ < nft^a.>ilgorithm ^^>»^^V^rpaitem searching 
This is tfao levelvdse search vemon ofliie algotWun 2. 

30 Lopuc 

• set U of possible marker paiiems 

• evaluation fijnctionoCP) for patterns Pint/ 

■ fiequeacy threshold X 
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= <P m r/| K/0 "«"'eCn a »™} of patnans. wbcrc is u-ae if and 
<Hjy if *c ficquenoy of pBttem P exceeds a gWen thi»!^ 

,.--1n £; .«c» *«r !-P" l=i" andl- ^ r" o-fl. «t of least sp««a 
10 spcciaiizationa of pattern p. 

Method 

luy. 

110. e:-{> 

111 // Start wlTb the most gentsralpsiLLcnis: 
15 U2.F:-{PmV\iha,'ei37u>P'\nU,Pn=P,suchthati'->P >; 

113. while/^!={> { 

\\/\, II Evaluate the csmdidate patterns: 

115. furcadiPulF{ 

1 1 6. \£ae{P) = true then { 

20 117. ]£^{p)^true^taiVD&GS^PTXi^^^^S 

118. } 

1 19. else remove P from set F 

120. } 

121. Q.^QvaaxsaF 

22> 122. // Generate a new set of canOidato patterns: 

123. C:-{} 

125 C'-Cumon{P'ini/|P'ini«CP)««^jfc'"'^"' in/^PO. 

126. ^"'"e> 
30 127. > 

128. F:-C 

129. > 

130. end 



35 
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luffiisure X* and is a user-specified minimum valtie) which is diosoa so that the 
sizes ofiV are large enough, such as 7. to give staiistically sufficiently reliable esti- 
mates for the gene locus and the score i^m^J nf marker mi is the Size of Sh also 
called iiiarksr-wisc pattern frequency of mf ood denoted hyf{mi). 

S As moiticmed above, the (signed) " measure of marker-disease associodon. A 
signed version of the measure is used in order to discriminate disease association 
from control associatiOD. The sfgned measure ±x^(^') of a haplo^rpe pattern P is 
positive if r is more frequent in cases than in controls, and negative oflienvise. 
Given a "(positive) association Threshold*' x. we say Lhat is "strongly associatctf ' 
10 with the disease if ±x*fP^. 

The first part of tlic IIPM-G method con be described as follows. Given the data — 
markers M genotypes FT, and phenotypes Y — the task is to output all haplotype 
patterns P that arc strongly associated with the disease status tor a ^ven value nf 
the aftsociarlcm threshold x We denote the collection of all such haplotype patterns 
15 lyys — that is, — (P is a haplotype pattern on M | ± If pattern parame- 

tBfS are specified — a maximuro gcuclic length, a maximum number of gaps, or a 
maximum length for gaps — the task is refined by requiring that ihBJ?e additional 
restrictions ate also fulfilled. 

The signed value is calculated from a 2x2 contingency tabic, where the rows cor- 
20 respond to the trait-aasociation statuses of the persons, and the columns correspond 
to the presence and absence of the haplotype pattern. A pattem.P='Cpj»...jp^ is pre- 
sent in a given genotype GK{giu gia}, {Bbi.gica}) iipi^il ocpfgfZ orpt^* for 
all j,l<-/<-A. if, instead of a genoiype, two haplotype vectors HrihiJ,...,h]Jd and 
Hi=(;i2J,...,h2fd are given for a person, pattem P is considered to be present in the 
25 person if it is present in either of the haploiypes, i.e.. if afher pi'^hji otpi=* fbr all 
/,l<^f<=*^ or Pi=hji or Pi-* for all i,l<^i<-k. 

The value of x* statistic is computed normally, and a negative sign Is attached, if die 
lelalive frequency of the haplotype pattern among the control persons is higher than 
among the trait-associated persons. 

30 The first observation in solving the pattern-mining task is Ihal given an assodaiion 
threshold jii a lower bound can be derived for the frequency of strongly associated 
h^lotype patterns as follows: 
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Qivcu ii 2x2 cualiugcucy table of tiie axuabeis of disease-associated (A) and control 
(C) persons eiflier matching a pattern (P) or not |5V), the test statistic fiar the dis- 
ease assueitiLluu uftlie patterii is defiiied by 

'STf. 'ify -nf, 

5 where teg is the ninxiber nf pCTSons with properties / and J, Ki the number of persons 
with property i, and ic the total number of persons. Given the number of allbuicd 
persons (je^), the number of control persons (jcc)» and a lower boimd x for the lesl 

siaiisTic, we can derive a lower bound for Ihc pailcru Jicqueucy among the affected 
persons C^^) as follows. Assuming the panem is disease-associated, we have 
10 njip • Jcc7/> • ncp. The test starislic is maximized when accp = 0, in^lying 
itjip = 3cp and itcN= tcc- Then 

(ff^. g^v re^ /F cpY - i^jip- ^cf-'^ „ fr„,. -3tc'fe 
and 

15 llie situation is symmetric for pnitecrtve haplolypes. and the lower bound Ibr nci* 
is obtained by simply swopping it^ and nc in the above result If disease-associated 
and prnrective hapTotypes are searched for at the same time, ihe smaller of and 
TCCP can be used as a lower bound for np, making the implementation slightly sim- 
pler. 

20 On another hand, given such a frequency threshold, all patterns exceedins the 
thres^hold can be enumerated cDlcicully with data-mining; algorithms or a standard 
depth-first search method. An algorithm that lirst tinds all haplnlype patterns whose 
frequency exceeds the computed lower bound and then evaluates the association 
measure on them, is guaranteed to find the exact set of .«nrongly disease-associaied 

23 palterus. 

The approach is suitable for finding protective haplotypc patterns by considering 
patterns P with ix^C-P) ^ '^'^^ dedvaHon of the lower bound for the ftequency 
among controls is idenueal to the case above. Obviously, both disease-associated 
and protective haplotypes can be found when \iO^(P^\ > x. 
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be, in additiau to qualftative. also quandtattve, ^"'^^^^ 
cemtation of a substance has a c««m 'jf":,^ ^^.-Jj ^^^ere is 
fonction e(W may have the fi»m e,?-; = <n« '/'^f/''^^^.2ecifi;d vatae. ^Mch 
fl.e absoluteftequency of patten, in the '^^^-^^^^^^^f^ticaily suf- 

strengfli of the method may be stUl increased. 

. , • f fnrm r = AAT, + + iMfc + «Z + i^' ^® dependent 
A linear model is of form r- A^i ^ •■- covariaxes such as envi- 

lomneotal fectots, and Z is a dummy ^^^'^ J ° . ^ secondly, the sig- 

P-»- ^'^^ -^r/asT^stdXS -r^rl Ph™^ is di- 
nificance of Z as a covanate is assessea oy u» s 

diotomous, then the logit transformation can be appUed. 
M^k^r scorns fn me case of^mimtt^phenot^ being studied 

1 locus are likely to have sttonget association ttian 

Haplotype patterns close to the DS 1°"^ ^^ "^^^^ ^ „ ^ «here most of d« 

"^l^^lfl, eS|lLe«dstrSiand«2ls»chthat;,r^*^;'„}l- 

mosomal region, potentiaUy identical oy a«.w w rneasurins ihe disease 

of marker dai^ Wle markers within gaps are ^^^^^^^^^ 
associationofthepatl^thewholechromosomalregionofthepaitemi^ gn 
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15 



20 



25 



be relevant. 



30 



in S„ and „ is the e^ectatlon of the «h sn^llest p vatae. ^ P 
domly drawn ftomflieqmfonn distribution. 
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Gem loccUizuilon 

The location of the gene, predicted as a function of the fscnres s(mi) and baaed on 
maximizing or minimizing the score, is predicted Lu 

- die locaiirro of the marker mt that maxiniizes or min i mi zes the marker score sCmj), 
or 

- the comhinaticn of most probable intervals for containing the trait-^usccptibiHty 
locus that covers at most the desired proportion r (re {0.100%}) of the orisinal re- 
gion obtained by taking aJI such points in the studied chromosomal region whose 
nearest raarlcer is wthin the k best scorlnfi markers, where k is selected such that the 
resulting area has length at most f limes the lengifa of flie studied region, and where 
k is maximal such value, or 

-those points in iho studied chiomosomca region whose iicaresl marker scores ax 
least y or at most whers is scoring function dependent and is selected so that the 
probability of ifae gene being dose to the marker is sufficiently hirgc 

15 The location of the gene m^ also be determined by fixpca invesilgatian of ifae 
marker scores or their vifaiflHzatian e.g. as a cnrve. 

PermutatUm Tests 

More information abom the significance of the obs^ed scores may be obtained by 
pcrmwatian tests. The results obudned by considering the marker frequencies or the 

20 linear model, as explained earlier, can be contested against the nuU hypothesis that 
ail the persons arc drawn from the same distiibution; liial is, ihere is no gene effect 
in the disease statuf;. We propose to pennute randomly the status fields of tiie pei> 
sons, keeping the proportions of aflfcctcd and control persons constant, ui a IksUiuu 
slmiluT 10 aie method of Churchill and Doerge (1994). We approximate marker- 

25 wise p values using permutations and then predict Uic DS gene to be m the vicmity 
of the maiLcr with Oie smallest empirical p vahie. Consecmive markers are depend- 
ent, and thus a large number of mutoaUy dependent p values arc produced. Hus is 
not a problem, simw we do not use Ac p values for iypodiesis testing, but only for 
ranlrins markers. 

^I) Marker-wise p values are used to re-score markers by their statisticul unexpccied- 
. ncss. Tlie test is canicd out as foUows: The phcuolypes of the persons are randomly 
snuffled a mimber (thousands) of times. The scores ate re-calculated for each per 
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mutation ia turn. Markep-wise p vOwipCmi) is the proportion of such permutation 
scores for maiker mj that are larger ihan or equal lo Che non-pctmulBd score. 

Each score s(mi) is then refined by replacing it by ihe raarfcer-wlse p valueiJ(>»i) of 
tiie score sOni^, 

5 Searching several genes 

Several genes may be seatcheri fnr simultanenusly by using marker patterns that 
rcfbr to several potcntifil gene loci at tiio some time. 

Examples 

Cei taliJ embodiments and results of the present invention eirc described in the fol- 
1 0 lowing non-limiting examples. 

Example 1 - Simulated Data Sets 

We evaluated the performance of the proposed HPM-G method wiih simulated data 
SBis that coirespond lu a ncccudy fouuded, relatively isolated founder subpopula- 
ticin. Simulation of a population isolate was chosen, since it is recommended as the 
15 sludy population for LD studies. However, tlie method can be applied to any popu- 
lation that is suitable for LU analysis, since no assumptions are made about the 
population stiucture. 

An isolaled founder population, whidi grows from the initial size of 200 to 100,000 
individuals in. 20 generations, was simulated. 

20 The population pedigree was first generated assuming distinct generations and ex- 
ponentiEd growth of the pi^ulatiou size. In each generation, the parents of die new 
bom individuals were randomly selected from members of the previoiLS generation, 
with (he exception Omt whenever a patent with at least one child was chosen, his/her 
spouse was always forced to become the other pBrent of iha child. Ibis procedure 

25 generates &axiily structiuc into cadi generation. 

In the siimilation of ialicritaace, eack member of the first generation was asslfiped 
to have one pnir of homologous chromosomes. The genetic length of ihe t;hrojui>- 
somes was 100 oM for botii males and females. Meiosis was repeatedly simulated, 
and in each meiosis the number of crossover points was taken fLoni a Poisson 
30 distribution wHth parameter value 1. which corresponds to the total gpnetic length of 
the chromosome. No chiasm inlerDsrencc was modeled To accommodate the feet 
that ever-increasing informativeness of marker maps may sonn facilitate whole- 
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xnarkcr intervals of exactly one cM. i-ac ,en,aining 
duancies in the founder pnp.,lation were 04 , thus 

Le alleles, Thp polyxnoiphiam infommtoa oonteat (PIC) of each nmrlcer was 



fixed at 0.678. 



„ „f th. 100 data sets for HPM-G and HPM, the processes of disease 
To produce each of the 100 ^ata indcpondontly. Next, these 

locus selection, diagnosing eind samplmg were wnv y 

proces5ses are described, 
chromosomes present ta the on^mol f J^^J^^ MKrtted the dis- 
carry a disease-causing inutatioii. 

probability of bcooning .ffi«t=d dq.»* aa „„pone„ i, 

flight to eo>>t<dn foetors s»eh ^ ^'^^^.^ ^^TT^ ^ +C, ^ 
,0 indicator variable., indices 4= ^^^'^^^^^ - 

rjr^^^rr^^o^^^'^-riS^-^ 

« r.^.i.H»inBai»««e»d*..^«-of««.inaividM«.= -e»«d 

Jb« Of iodivid^ls tabdod a. was ««.a.n..y ^ » ^ 

affected sample. 

izes genovpe dat^^ ihe octroi ^^^^^ ^^^^ ^^p^^^ 

netio component. 
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Lwi to <«»tc ftanUy-based p^docontrol chromosomes. Tte was do»«^ 
by Mking die alleles in the »oiL-*«Dffl»itted tte,moK««a Mg«nen« of 
5 entsofcechaflfectedmdividnal«ndlabdingflwiOi>s«miolchromosoma».to 

^ Ite sfamlatot a. »ven, yDAch conespcmib lo enor-ftcc li^lolyi>»6 »d » 
esqwcledm slUJnly avm HPMio liw comparisons. 

•lie smmlaflon of nUsstog <to waa b.«d on ft. notion -j 
,0 UboraK«icsftero«emstoba».ot>-peK«fclu».=mEofmi5jingdata.F>rs^« 
«nd «, dusiet to eert^to tadividuals. «hich can be a cons«pience of lo« 
^a^samples. Second, cenato marka. may fimcUon poorly, l*ety ^«ne 
miss^ genotypes. To minnc »eh Custarins at « ^l-^^^':; 

„„„ete«: parameter a c„rresponds to ttc amonnt of m^ *=*^"2V° 
individual, and p«»no»r /J to the amomatbatelusiersm markets, ll^rmssmg 

ganolypes were selected using d» ftjUowioBPtocodote: 

For each individual /. a pasond misdnR gcnoW" piobabiliiy xf wa, compmed 
as ti» X value oriLe firs, random point in (x, plane JJf ^^Jt'dtTd,^ 
inequaUty > ^l/»". Having C0D5.uted the value of vanable ;c/ for the mdmd,^, 
each of hWher genotypes «s dun. labded as mis^g Trith probabiUty In the 
second phase. The prooedute was repealed fbt each marker. Tot each marker^ a 
maX« Lut^ probability ^ ^ c<»puted in an analogo.« fi«Uoa as lie x «dne 

of the Orst random point in y) plane ym.'S> *« '»*f?,'^,t^^ 
^l/,", and each genotype ...responding to *«marker was labeled as nnssmg 
25 iodcpendcBtly fei each indSvidnal with pmbabUliy x". 

values of variables a aod ;9wcc ompirieally adjusted to proAU» fired ov^ 
aU levels of ndsalos date. These values we: 25 aud 80 iiK B«i 13 -«1 40 fee 

10% of missing tlaxa. 
Exaoyile 2 - Comparison to JIPU 
30 •ll^e.locan^^tlonaccurau.wa. explored by plcrttingc^^^ 

the hdgtt of the curve shows the fr^on of data .cls for wtoch the locahzation v.a^ 
success^, as a fo^Ktion of the length oftbe predicced ^^-^^^^ 
of ISO affected ^ I.i0 control geaolypes. The ,»a»inum lenglh of a P^^m ^^"^ ^' 
and one gap of one marker ms aUov^. The association flireshold was set to 10. 
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These uuiubecs wece based ou expedmeuiatiou. For comiiarisdii, we alsu iiliuw tiie 
conesponding curve fin- HPM -witb 1/3 smaller sample size, and thus equal geaotyp- 
ing cost (figure 1). "^^^th HPM -we used association threshold 9, the patametcrs fixr 
the pattems were the same than those used whh HPM-G. 

S The results show tiiax HPM-O has a high accuracy, and that it is extremely competi,- 
ttve even in compai'lsoii to stato>of-tho-art methods that use explicitly haplotyped 
data. 

JScampIe 3 - effect of sample size 

The effect of sample size was examined by experimenting with sanq)le sizes of 
10 lOO+JOO, 150+150, 200+200 and SOO+SOO'pc^ple (ngure 2a). FiRUi* 2b shows die 
coirespcmding results for HPM. 

HPM-G pofbnns well even with only lOO+lOO genotypes. On the other hand, if 
the amount ufdaUtis mcieased. ihe aix^umuy is Improved. 

Example 4 - e^ect of missing data 

15 Tlie influence of missing data was explored by randomly removing 5% or 10% of 
marker gcnoiypcs (figure. 3a). Figure 3b ahows the corresponding results for HPM. 

These i-esults show that HPM-Q is very robust against missing data. 

Example 5 -Local ligation Accuracy withrermutation Tests 

Permutation lesls were used to obltuii more iiifoumnlon about tho significance of 
20 observed marker frequencies. Madcer-wise P values were used to siorl markeirs by 
tiicir slatislical unexpectedness, not to test tiic statistical significance of the findings. 
We pertbrmed the ibllowing eocpeiiment in order to see if the precliclSon accuratgr 
cao be impiioved by pcnnutation tests. We predicted the location of tiic DS gene to 
be at the mtarker with the smallest P vahie instead of the mo.«H: firequem marker. The 
25 localization accuracy with 100 permutations compared to Ant without permutations 
is .«:hnwn in figure 4. The airves are almost identical, which is due ix> the evenly dis- 
tributed and identically informative markers. 

The situation could be different widx t«d marker dam, where permutation tests are 
like]3^ to bring a greater benefit. 

:i0 
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Claims 

1 A tnemod for gene mapping ftom genotype and phenmype daUi. wlucb method 
utiliz^^age dis^uiUhn^ between marker - 

nucleic ucid or protdn sequeaces or strings of single-nucleotidc polymorpbisma 
5 deriving from a chromosomal region, cbaractcria«d in tlwt 

i) all m^k«r patteas F thai saOsS^ a pattern evalnation fimctioa ^^/^ are 
searched Irom the data, -wherein 

a the madoK pMKm «e ffl?«esd»ns involving Ihe maiter-anele assign- 

,0 nMfailalwi8bl6sandi»ixiltaVP''™*yPM;"'a 

b the pat»xu evalu^ioa fencdon i«voh« some ^^^l '^'^"f 

studied, 

bv testing each nwker of pattern P against the cotiespnnding aUde pak iii 
15 gLotype G. eifectively finding out iflherc is a possible haplotype contigui.- 

ilon of C? whichmatchcsi' mid coontiaB the possible inarches as maU;Ues, 
ii) each trrorker of the data is scored by a marker score sM, ^vbich is a &nc- 

tion of the set defined as the set «f J^^P^«^ fJ^^PP^^jf a^^^ 
^^d^fyingtbe pattern evaluadon function e as dcfinedmstep (t), and 

the location of the gene is predicted as a fimcdon of the scores ^« of all the 
' Z^^Z li i a« data a^ based on maxuni^g the score if the scorm. 
anction is designed to ^vohigher scores closer to rhe gene, arxd on inmi^ 

Se score if *e .corhtg function is desired to Si-/7^«^^^,tX P 
Ste gene, as is the case for instance when the scores ^ marker-wise p 

25 values 

2. A method according to olahnl,characteri«d hi that a marker IS scored as 
ifae sumof Oie weights of overlappuigpanenis. 

3. Amedxod accordingto olaim2, characterized in that the might of a pat«^ is 
a Amotion of 



20 
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- ihe uncertainty of matching; eVg. a'"^, where N[l] is the nmriber of het- 
ero:^gous markers within the psttexn in genotype i, summed over all 
umLched gisnuLypt», ur 

- Ihe infonnaliveness of (bs paitem. e,g. 2"„ vidiere H is the average heterozy- 
5 gosity within the pattern, or 

- the strength of association, e.g. chi squared. 

4. A mdiiod of claim 1 , charactei-iased in iliat nuukcr patterns P arc searched 
fbr by the fbllowing algorithm: 

Input 

10 • set 17 of possible marker patterns 

• evaliiatinn fiinction «(P) for patterns P in U 

• (generalization) relation < for patiema in U 

• where the flinciion e and the rdarion < are such That If eiF) is true and F' < P» 
then 6(7*0 is also true 

15 Output 

• set S= {P G (J\ fi(P) Ui true} of patterns 

Method 

2. // TnitiaHye the set of evaluated patterns: 
20 3. -B.— {} 

4. // Stan with the most general patterns: 

5. Gen ;= {PmU\ there isnoP'ia. U, P' !- P, such thatP' < P} 

6. II Recursively evaluate panems hi a depth first order: 

7. foreach-P e Gen { evaluatePattems(P) } 
25 8. end; 

9, procedure evaluaftePattemsCP) { 

1 0, insert JP hito the set E 

11. ifeCP) =?^««then| 

12. insert/* into set 5 

30 13. //Find an speclaliTationsofP that have not been Usslcdycl, and 

14. // ovoluateHiem recursively: 

15. spec\^KP'\r\.U-E\P<P',P'UP,anat}f»r6tsnoP''TSi.XJ-E,r"\-P 

16. andP" 1- P; ^vith P < P" < P'}; 

17. fureauU P' m Spue { evaluatdPattransC/"); ) 
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18. \ 

19. } 

5. Amethodofclaiml,cbamctcrf«^iatoliiciaaikerp«tu^ 
Sir by the fisUowinfi alguriihm: 

5 Inpui 

• set U of possible maricer pattens 

,• evaluaiiou functioii e(F) for patterns Pin U 

• fieguency threshold x 

10 - Ii> i« 1 ««^-<^) ^ n^> of panen^s, ^ere is true if aad 

cmly if the fte<iuency of pattern P exceeds a givca threshold x 

Method 
20.5^:- O 

21. // Initialize the set of evaluated patterns: 
IS 22,JJ:=0 

23 Start with the most general patterns: „ . 

2^.Gen :- {/'in i^l there is noP"iii.U.P'\-P, ^uch ThaiP->P ) 
25 J/ Recursively evalnaTe patterus in a depth-first order: 
26. forcach Pin Gen { eval\xalicPattcaB(/') > 

20 27. end 

28. prooedwc cvaluatcPattccnsCT) { 

29. Insect P imo ihe set £ 

30. iftffi(J*) = frMsthea{ 

^1 ifefp^ =znfetheniiisenPintosBt5 

33, // them recuisivdy: diis« rr p P^i-P 

35'. andP"]^P\ wUhF'^randP"^P) 

%6. fiaeach i" in Spec { evaluatePattenis(P) > 

30 37. > 
38. J 

6. An«*odofcteim c.h,«««ri^ in th«d»»>»l«r patent ^«ese«ch«l 
for by the following algoriihm: 
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luput 

• marker map in;^ 

• phdunypc vcctot Y = (7/, Yj^ 

• genotype matrix if of size » » Jk « 2 (n pereons, k markere, 2 alleles per person 

5 and marker) 

• associaticm threshold x for chi-squared xest 

• waxiuuiuipotienileugth/ 

• maximuin niimb^ of g^s s 

• maximum gap size s 

10 Output 

• BQiS {PmU\ e(JP) is true) of^aXbsxtss, 

• where U consists of patterns on M fhat consist of marker-aUeJc asstgDinc^ts and 
That adh^ to parameters /, gj and /, and 

• where e(P) is troe if and only if chi-squarcd test on P using genotype matrix H 
15 and phenotypes r ejoieeds the given rhreshnld x 

Method 
39.5 :-f> 

AOJJ Number of case and control persons: 
41 ./>(^ :Biiiimber of affected persons; 

44. // A lower hound for panem frequency: 
46.// Variable for lierarfng ova different pallenis: 

9.5 47. P = Ci, ... , p0 C'*' . 

48. for / := 7 to ^ { 

49. // allelesCw/) is the set of alleles of ibc /zUi luaiker 
SO.&reach a in alleles(7n/) { 

Sl.pi:-^ a 

30 52.// Test paxtcm P and all its osaensioos: 
l»3.checkPattewis(P, 1.10,0) 

54. //Reset p/: 

55. /;/.-'*' 

56. ) 
35 57.} 

58. end 
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59.11 Test haplulypc pallcm P aud all palLocna ilxai chu be grawsratocl by cxiciuJiuB P 

60. // Scorn tbe right: 

61. prouBdun5chcckPaUcrasCP, 5tor4 4 ttr_of_sups, gupjength) { 

62. // Output strongly associated pactems 

5 63. if ch5-squared(P, M,H,Y)>^x and pi != '*' then insert /» into set A' 

64.// Ketum if extended paTTems would be ton long: 
65.if f =• ^ or i+J-start> I thenreraxn 

66. //Return if extended patterns can not be strongly dis^b'^odaced: 

67. if fircquoicy ofP in oSboted peraona is less than lb 
10 68. then return: 

69. // Creole and test legal extensions of current pattern P (3 cases): 

70. // 1. Give marker /+/ all possible values: 
Tl.flrnreach a in alle1es(i«,-+7) { 

72,pi+j :- a 

15 73. checkPattems (P, steirt, nr_of_gaps. 0) 

74. } 

75. // 2. Introduce a new gap starting at marker i+J: 
76.if «fs and nr_of^aps < g ani s ^ J ihoi { 

77,;j,'+i 

20 78. diecKPatcems CP, J'^t w pfsaps I i) 

79, } 

80, // 3. Extend the current gap over marker 

81, if Pi - '*' and gcpjength < sthcn { 
)S2.pi+j := '*' 

25 83. checlcPaiienis (P. staru i+h nrjtf^cs?s, gapJength-^J) 

84. ) 

85. // Before returning, XG&eipi+/: 

86. pi+j :=» 

87. rerum 
30 88.} 

7. A metbod of claim 1, characterlaed in ttiai the marker patterns P are searched 
for by the following alguiilhui: 

Input 

• set U of possible marker patterns 
35 « evaluation function e(P) for pattems PinU 
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• (generalizadon) reladon < for panems In C/. "where the fUnciicai e and ihe teldr 
tion < are such tiiat if is amii*' < P, then e(PO is al^ 

Output 

• set5 = {/* in C/| cC/*) to mie} of panems 
b DefinitinnH 

. fonctionL^^: U^l"", ZgsiP)- [P"m U \P> P' and P' P and there Is m 
P"in U P'<wdP>i*" >/*'}. ihe sec of least general gen- 

eralimtions of pattern 

• ftinctidnil^jr; U->2",r^(f)= {P'iaU \P <P' and F' l"' P and there is nu 
10 i>" in C; such thatP != -P" != P' <-P" < P'h of least special spo. 

cialvialions urpatlBTuP. 

Method 
89.^: = {} 
90.S: = O 

15 91.// STart with tbe most general patterns: 

92. r .-{P'uiU. then: b no P'uxU.P' !- P, such tJiatP' < P}; 

93. wmieJ5*I={> { 

94. // Evaluate the candidate patterns: 

95. foiceacbPmF { 

20 96. if e(P) — tj ue then bascrtP into set S 

97. else remove P from set F 

98. } 

99. |2: =6iinioni'' 

100. // Generate a new set of candidate patterns: 

25 101. C: = 0 

102. foreachPinF { 

103. C : = C union { /" in U\ P'hiLssiP) and for all P" va.Lf^Py. 

104. P"mQ) 

105. > 

30 106. F'.^C 
107. } 



8. A method of claim 1, characlwrl7Ail m that the marker panems P are searched 
for by the following algorithm: 



108. end 
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Jnpot 

■ adtU of possible maiker patterns 

• evaluuliun Ainctian e(jP) far pauems PhiU 

• Ixequaicy threshold X 

5 Output 

■ ad S {P ia U \ e(i») and aeVP) is true} of patterns, where aeil*) is irue if and 
only if the firequenoy of pattern P exceeds a given threshold x 

Definitions 

• luucUuu Lg^. U'> 2^, LggiP) -^{P' taU \P'>P' and P'\=P and there is no 
10 P" in 17 stick thai P 1= P" 1=^ P* and_P P" P*}, the set of least general 

geneializatious of pattern P. 
m function Ija- 2", Lss(P) = {P'in.U \P'^P andP' 1= P and there is no 
P"iaU such that P != P" !«=■ P'andP' -> P" -> P), the set of least special 
^edalizatians of pattern i*. 

IS Method 

109. S-.^'Q 
lll)-g: = {> 

111. // Start with the most general patterns: 

1 12. 'F:={PisiU\ There is noP'm U, P' 1= /> xiu:h that P -> P' >; 
20 113. whUeif|={) { 

114. //EvaluaiB the candidate panenis: 
H5. forcach/'inJS' { 

116. i£ae(jP) = ttue then { 

117. if K-/*) insert P into set S 



120. } 

121. Q'.^QvmonF 

122. // Generate a new set of candidate pattraens: 
30 123. C : = {> 

124. fbrcachPinF { 

125. C : = C uniott iP'iaU\P' inLssCJP) andjbr allP** in JUggfJ^Y^ 

126. P"rnQ} 

127. > 

35 128. F'.-'C 
129. > 



25 



118. 
119. 



} 

else remoT^ P firom set F 
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130. eud 

9. A mediod of claim 1. cbaracierlzed in ihat 

a) the phenoiype being snidied is qualiiaiive* and 

b) the pattern BvaliiatJon tiinction e(F) has tiie fbrm e.(I*J = true if and only if 
S e'(P) > X, where e'(P) is the (signed) association mcasuxc and a: ia a user- 
specified minimum value, which is chosen so that the sizes nf are large 
enough, 3Uoh as 7, to give atatistioally sufiBciently reliable estimates for the 
gene locus, and 

c) the score s(mj} of marker m/ is the size ofSb also called maikcp-wisc pat- 
10 tem frequency of m/ and denoted hy /(mi). 

10. A mcdxod of claim 1, characterized in that 

a) the patiem evaluation ftuictlon has tlie foan e{P) = true If a/id only ?/ 
e'(F^ > X, where e'(P} is the absolute frequency of pattern F in the datsj and 
X is a nscr-spccificd vahic, which is chosen so ihat the sizes of 5/ oro large 

15 enough, snch as 7.0, to give statistically sufficiently reliable estimates for 

tlie gene locus, and, 

b) m order to derive the score s(mi^, the p value (statistical significance) of 
each marker pattem P in determining the phenoiype being studied is evahi- 
ated, and 

20 c) the score s(m^ is the distance between the observed p value distnbution nf 

patterns hi S, and the uniform disnihutioTi, defined as average of (pi - gd 
log (pi I qi) over all / - where n is the number of haplolype patterns m 
Sf, pt is tho fth smallest p value in i?,, and ^/ is the e»5»ectstioai of die tSa. 
smallest p value, if file p values were randomly drawn from the unifiam 

25 distribUTlon. 

11. A method nf claim 1 0. characterized in thai the p value is compwed ushag a 
Imear model of form Y- fi^Xy + ... + iffiSi + «Z + A, where the dependent variable 
r is the phenoiype being sUidied, Xi Ihruugh arc covaiiates, such as envkon- 
mental fectors, and 2^ is a dummy variable for Ihe occurrence of the baplotype pat- 

30 teiu, and 

the coefBdents a and jft are adj usicd lor best fit, and then 
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the significance of Z as a covariate is assessed by using a t test wiili the null 1^- 
potheas "a = 0". 

12. A meibod of daim 1, cbaracterlzed in that each score s(mi) is refined by re* 
placing it by the market-wise p value of die score s(m^t where the sistiatical signifi- 

5 cance of s(mi) is measure against ilie null hypodieses ibai there is no gene effect 

13. A mefliod of claim 1 2. characterized in that the marker-wise p values p(mi) 
ate determined by randomly perannhigphcnoTypes. 

14. A method of claim 1 . characterized in thai the area returned from the predic- 
tion of gene location is contiguous or fragmented or a point. 

10 15. A mcdiod of claim 1, choractexized in that the location of die gene, predicted 
as a function of die scores s(mj) and based on maximizing or nainimizing the score. 
Is predicted to The location of die marker w/ that maximizes or minimizes the mar- 
ker score s(mi). 

1 6. A method of claim 1, characterized in that the location of the gene, predicted 

11 as a function of the scores sOni) and ha.sed nn rnaximizing or minimizing the scor^ 
is prediaed to the combination of roost probable intervals for coutainiug ihc irait- 
susoeptibilily locus that covers at most desired proportion i (/6{0,100%» of the 
original region obtained by taking all sudi points in the studied chromosomal region 
whose nearest marker is within the k best scoring markers, where k is selected such 

20 ilmt die rcsiultmg area has length at most / tiroes ihc length of the studied region, and 
where k is maximal such value. 

17. A method of claun 1, characterized in that the location of the gene, predicted 
as a function of the scores s(m^ and based on maximizing or mimmizing the score, 
is predicted to those points in the studied chromosomal reginn whose nearest 

25 uuailLBi- scores at least or at most where y is scoring function dcpcndcaa.t tmd is 
selected so that the probability of the gene being cIo.se to the marker is sufflciemly 
large. 

18. A method of claim 1, characterized in that the locadon of the gene, predicted 
as a function of the scores s(mi) and based on maxinii2ang or minimizing die score, 

30 is detctmincd by expect mvcstigation of the marker scores or their visualization. 
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19 A method of clauu 1, characterized in tlutl .cvcnd gcic ''f^/^*"^"^" 
L^Lly by using patterns that reter to several poteat^al gene loc at the 



same time. 



20 A «,mpu.«:«««M.lc dau. storage moHun. having '=<»*'^°'-'^^'!,^ 
of the ptecediag claims when executed on a computer. 

21. A cong-mer system. cl»«cteri«d to Ouu U fa ptogtammed lo perron.. ll«= 
method of any of the clsums 1 to 19. 
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(57) Absiravl 

The pFesem inveniion re]aies to a mediod for gene mapping 
from genotype aad pheaotiype data, which method utilizes 
linkage disequilibrium between geaietio mcadcers w/, whiob 

are polymorphic nucleic acid or protein seqoenc^ or 
string of siaAle>micleotide pol^'morplusuui deriving irum a 
chromosomal region. All marker patterns Jf that sadsly a 
certain pattern evaluation function e(P) are searched &om 
l^c data, each marko: mt of the data is scored by a marker 

score and the locaiion of The gene is predicted as a function 
of ^e scores s(mO of all the markers m/ in the data. 



VA3T.0TT0 04-04-02 18:30 



utsTA- tsssassesroi 



KENEUEPATREX Aiiiakaspalval 



SIVU 094 



isiu MSB 8 BBSfi7aiJB^B(-87a S.03E/iM3 T-ZBB 
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